from IPython.core.display import display, HTML
display(HTML("<style>.container { width:70% !important; }</style>"))
This report aims to demonstrate implementation of a complete data analysis workflow, including but not limited to data acquisition, data wrangling, data integration, data persistence and data analysis. We will use Python and other open source technologies while working with the New York City (NYC) Tree Census 2015 and the San Francisco (SF) Tree Census datasets. The objective of our analysis is to gain better understanding of NYC Tree Census, including tree population by boroughs of NYC, overall health status, visual representation of the location of surveyed data and comparing common species with Tree Census of SF.
Report structure as follows:
For above purposes, we have imported Matplotlib, Pandas, Numpy, Plotly, Gmap, Seaborn and other libraries.
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import re
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from pandas.plotting import scatter_matrix
import seaborn as sns
from pylab import rcParams
import requests
from bs4 import BeautifulSoup
import json
import sqlite3
import datetime as dt
import squarify
import plotly
import plotly.plotly as py
import plotly.offline as offline
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objs as go
import ipywidgets as widgets
from scipy import special
import joypy
from mapboxgl.utils import *
from mapboxgl.viz import *
from pylab import imread,subplot,imshow,show
pd.set_option('max_columns',None)
pd.set_option('max_rows',None)
%matplotlib inline
We acquired the NYC Tree Census data from the NYC Open Data web page through their API. The dataset contained roughly 700,000 observations, but API's by default limited allowed to retrieve only 50,000 observations. To fulfill the requirements of our analysis, we manually increased the limit to 300,000 observations.
Based on the objective of the project, we acquired the second dataset by downloading the static SF Tree Census dataset and subsequently uploading it to our Python environment.
According to the NYC Open Data Source, the Tree Census was "conducted by volunteers and staff organized by NYC Parks & Recreation and partner organizations. Tree data collected includes tree species, diameter and perception of health."
response = requests.get("https://data.cityofnewyork.us/resource/uvpi-gqnh.json?$limit=300000&$offset=0") #we set the limit for 300,000 due to the problems with the memory error.
print(response) #Randomly selected 300K trees for analysis from data set
result=json.loads(response.content)
results_df = pd.DataFrame.from_records(result)
results_df.head(5)# Convert to pandas DataFrame
<Response [200]>
| address | bbl | bin | block_id | boro_ct | borocode | boroname | brch_light | brch_other | brch_shoe | cb_num | census_tract | cncldist | council_district | created_at | curb_loc | guards | health | latitude | longitude | nta | nta_name | problems | root_grate | root_other | root_stone | sidewalk | spc_common | spc_latin | st_assem | st_senate | state | status | steward | stump_diam | tree_dbh | tree_id | trnk_light | trnk_other | trunk_wire | user_type | x_sp | y_sp | zip_city | zipcode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 108-005 70 AVENUE | 4022210001 | 4052307 | 348711 | 4073900 | 4 | Queens | No | No | No | 406 | 739 | 29 | 29 | 2015-08-27T00:00:00.000 | OnCurb | None | Fair | 40.72309177 | -73.84421522 | QN17 | Forest Hills | None | No | No | No | NoDamage | red maple | Acer rubrum | 28 | 16 | New York | Alive | None | 0 | 3 | 180683 | No | No | No | TreesCount Staff | 1027431.148 | 202756.7687 | Forest Hills | 11375 |
| 1 | 147-074 7 AVENUE | 4044750045 | 4101931 | 315986 | 4097300 | 4 | Queens | No | No | No | 407 | 973 | 19 | 19 | 2015-09-03T00:00:00.000 | OnCurb | None | Fair | 40.79411067 | -73.81867946 | QN49 | Whitestone | Stones | No | No | Yes | Damage | pin oak | Quercus palustris | 27 | 11 | New York | Alive | None | 0 | 21 | 200540 | No | No | No | TreesCount Staff | 1034455.701 | 228644.8374 | Whitestone | 11357 |
| 2 | 390 MORGAN AVENUE | 3028870001 | 3338310 | 218365 | 3044900 | 3 | Brooklyn | No | No | No | 301 | 449 | 34 | 34 | 2015-09-05T00:00:00.000 | OnCurb | None | Good | 40.71758074 | -73.9366077 | BK90 | East Williamsburg | None | No | No | No | Damage | honeylocust | Gleditsia triacanthos var. inermis | 50 | 18 | New York | Alive | 1or2 | 0 | 3 | 204026 | No | No | No | Volunteer | 1001822.831 | 200716.8913 | Brooklyn | 11211 |
| 3 | 1027 GRAND STREET | 3029250001 | 3338342 | 217969 | 3044900 | 3 | Brooklyn | No | No | No | 301 | 449 | 34 | 34 | 2015-09-05T00:00:00.000 | OnCurb | None | Good | 40.71353749 | -73.93445616 | BK90 | East Williamsburg | Stones | No | No | Yes | Damage | honeylocust | Gleditsia triacanthos var. inermis | 53 | 18 | New York | Alive | None | 0 | 10 | 204337 | No | No | No | Volunteer | 1002420.358 | 199244.2531 | Brooklyn | 11211 |
| 4 | 603 6 STREET | 3010850052 | 3025654 | 223043 | 3016500 | 3 | Brooklyn | No | No | No | 306 | 165 | 39 | 39 | 2015-08-30T00:00:00.000 | OnCurb | None | Good | 40.66677776 | -73.97597938 | BK37 | Park Slope-Gowanus | Stones | No | No | Yes | Damage | American linden | Tilia americana | 44 | 21 | New York | Alive | None | 0 | 21 | 189565 | No | No | No | Volunteer | 990913.775 | 182202.426 | Brooklyn | 11215 |
results_df.describe()
| address | bbl | bin | block_id | boro_ct | borocode | boroname | brch_light | brch_other | brch_shoe | cb_num | census_tract | cncldist | council_district | created_at | curb_loc | guards | health | latitude | longitude | nta | nta_name | problems | root_grate | root_other | root_stone | sidewalk | spc_common | spc_latin | st_assem | st_senate | state | status | steward | stump_diam | tree_dbh | tree_id | trnk_light | trnk_other | trunk_wire | user_type | x_sp | y_sp | zip_city | zipcode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 300000 | 296319 | 296319 | 300000 | 300000 | 300000 | 300000 | 300000 | 300000 | 300000 | 300000 | 297565 | 300000 | 297565 | 300000 | 300000 | 285601 | 285600 | 300000 | 300000 | 300000 | 300000 | 285583 | 300000 | 300000 | 300000 | 285601 | 285602 | 285602 | 300000 | 300000 | 300000 | 300000 | 285601 | 300000 | 300000 | 300000 | 300000 | 300000 | 300000 | 300000 | 300000 | 300000 | 300000 | 300000 |
| unique | 187703 | 168591 | 173905 | 49950 | 2030 | 5 | 5 | 2 | 2 | 2 | 59 | 1290 | 51 | 51 | 220 | 2 | 4 | 3 | 298478 | 298515 | 188 | 188 | 190 | 2 | 2 | 2 | 2 | 129 | 129 | 65 | 26 | 1 | 3 | 4 | 92 | 112 | 300000 | 2 | 2 | 2 | 3 | 299510 | 299714 | 48 | 187 |
| top | 2750 VETERANS ROAD WEST | 2049050001 | 2000000 | 208115 | 5020801 | 3 | Brooklyn | No | No | No | 503 | 20801 | 51 | 51 | 2015-10-27T00:00:00.000 | OnCurb | None | Good | 40.61097709 | -74.15686098 | BK82 | East New York | None | No | No | No | NoDamage | London planetree | Platanus x acerifolia | 62 | 24 | New York | Alive | None | 0 | 4 | 381746 | No | No | No | NYC Parks Staff | 940697.4014 | 161910.8114 | Brooklyn | 10312 |
| freq | 84 | 121 | 267 | 248 | 3504 | 76380 | 76380 | 272592 | 288690 | 299811 | 29041 | 3343 | 26601 | 26492 | 6769 | 288584 | 249292 | 228272 | 28 | 28 | 8181 | 8181 | 189251 | 298288 | 286858 | 242967 | 205934 | 36919 | 36919 | 25160 | 48146 | 300000 | 285601 | 211750 | 292056 | 27925 | 1 | 299560 | 286365 | 294626 | 114377 | 28 | 28 | 76380 | 11118 |
df_api = results_df
df_api = df_api.drop(["bbl", "st_assem", "st_senate", "bin", "block_id", "boro_ct","borocode", "brch_light","brch_other", "brch_shoe","cncldist","cb_num","census_tract","council_district","created_at","nta","root_grate","root_other","root_stone","sidewalk","steward","stump_diam","trnk_light","trnk_other","user_type"], axis=1)
df_api.dtypes
address object boroname object curb_loc object guards object health object latitude object longitude object nta_name object problems object spc_common object spc_latin object state object status object tree_dbh object tree_id object trunk_wire object x_sp object y_sp object zip_city object zipcode object dtype: object
Review of the columns format for possible conversion to appropriate format.
# Define a list of columns to convert and their respective data types
columns_to_convert = {
"latitude": float,
"longitude": float,
"x_sp": float,
"y_sp": float,
"tree_dbh": int,
"zipcode": int
}
# Loop through the columns and apply the specified data type conversion
for column, data_type in columns_to_convert.items():
df_api[column] = df_api[column].astype(data_type)
Conversion of the selected columns in to appropriate format for further analysis.
print(df_api.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 300000 entries, 0 to 299999 Data columns (total 20 columns): address 300000 non-null object boroname 300000 non-null object curb_loc 300000 non-null object guards 285601 non-null object health 285600 non-null object latitude 300000 non-null float64 longitude 300000 non-null float64 nta_name 300000 non-null object problems 285583 non-null object spc_common 285602 non-null object spc_latin 285602 non-null object state 300000 non-null object status 300000 non-null object tree_dbh 300000 non-null int32 tree_id 300000 non-null object trunk_wire 300000 non-null object x_sp 300000 non-null float64 y_sp 300000 non-null float64 zip_city 300000 non-null object zipcode 300000 non-null int32 dtypes: float64(4), int32(2), object(14) memory usage: 43.5+ MB None
Based on above general info of our data frame, we can conclude that major columns like: coordinates, tree diameter size, zipcode and others were converted in to appropriate format. Some observations have missing values.
In accordance with the project objectives, we acquired the SF Tree Census dataset, stored in CSV format, which was then uploaded into our Python environment.
To ensure the data's relevance, we removed irrelevant columns and updated the DataFrame, making sure that the remaining columns are in their appropriate formats. The dataset comprises 190,000 observations.
For reference, you can access the dataset at the following link: Street Tree List - https://data.sfgov.org/City-Infrastructure/Street-Tree-List/tkzw-k3nq
# https://data.sfgov.org/City-Infrastructure/Street-Tree-List/tkzw-k3nq
results_sf = pd.read_csv('street-tree-list.csv', na_values=['???','None','none','null','NULL','nan','NAN','-'])
results_sf.head(5)
| TreeID | qLegalStatus | qSpecies | qAddress | SiteOrder | qSiteInfo | PlantType | qCaretaker | qCareAssistant | PlantDate | DBH | PlotSize | PermitNotes | XCoord | YCoord | Latitude | Longitude | Location | Fire Prevention Districts | Police Districts | Supervisor Districts | Zip Codes | Neighborhoods | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 196949 | DPW Maintained | Pyrus calleryana :: Ornamental Pear | 9 Young Ct | 1.0 | Sidewalk: Curb side : Cutout | Tree | Private | NaN | NaN | 16.0 | Width 3ft | NaN | 6.016267e+06 | 2.096084e+06 | 37.736362 | -122.386202 | {'longitude': '-122.38620200123', 'human_addre... | 10.0 | 3.0 | 8.0 | 58.0 | 1.0 |
| 1 | 203422 | DPW Maintained | Acer rubrum :: Red Maple | 9 Yerba Buena Ave | 1.0 | Sidewalk: Curb side : Yard | Tree | Private | NaN | NaN | 2.0 | Width 4ft | NaN | 5.993355e+06 | 2.097295e+06 | 37.738392 | -122.465507 | {'longitude': '-122.465506999949', 'human_addr... | 1.0 | 8.0 | 4.0 | 59.0 | 40.0 |
| 2 | 115737 | Significant Tree | Acer rubrum :: Red Maple | 9x Yerba Buena Ave | 1.0 | Sidewalk: Curb side : Cutout | Tree | Private | NaN | 2016-02-24T00:00:00.000 | 3.0 | NaN | Permit Number 776557 | 5.993642e+06 | 2.097056e+06 | 37.737752 | -122.464496 | {'longitude': '-122.46449593033', 'human_addre... | 1.0 | 8.0 | 4.0 | 59.0 | 40.0 |
| 3 | 16476 | Permitted Site | Eucalyptus nicholii :: Nichol's Willow-Leafed ... | 9X Newhall St | 10.0 | Sidewalk: Curb side : Cutout | Tree | Private | NaN | NaN | 12.0 | 3X3 | NaN | 6.018697e+06 | 2.097076e+06 | 37.739219 | -122.377869 | {'longitude': '-122.377869364283', 'human_addr... | 10.0 | 3.0 | 8.0 | 58.0 | 1.0 |
| 4 | 16473 | DPW Maintained | Eucalyptus nicholii :: Nichol's Willow-Leafed ... | 9X Newhall St | 7.0 | Sidewalk: Curb side : Cutout | Tree | DPW | NaN | NaN | 12.0 | 3X3 | NaN | 6.018697e+06 | 2.097076e+06 | 37.739219 | -122.377869 | {'longitude': '-122.377869364283', 'human_addr... | 10.0 | 3.0 | 8.0 | 58.0 | 1.0 |
# Remove irrelevant columns from the DataFrame based on the previous step
columns_to_drop = [
"qLegalStatus", "SiteOrder", "PlantType", "qCaretaker",
"qCareAssistant", "PlantDate", "PlotSize", "PermitNotes",
"XCoord", "YCoord", "Location", "Fire Prevention Districts",
"Police Districts", "Supervisor Districts", "Zip Codes", "Neighborhoods"
]
test_sf = results_sf.drop(columns_to_drop, axis=1)
test_sf.tail()
| TreeID | qSpecies | qAddress | qSiteInfo | DBH | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|
| 191529 | 48137 | Tree(s) :: | NaN | Sidewalk: Curb side : Cutout | NaN | NaN | NaN |
| 191530 | 48342 | Tree(s) :: | NaN | Sidewalk: Curb side : Cutout | NaN | NaN | NaN |
| 191531 | 48276 | Tree(s) :: | NaN | Sidewalk: Curb side : Cutout | NaN | NaN | NaN |
| 191532 | 46884 | Tree(s) :: | NaN | Sidewalk: Curb side : Cutout | NaN | NaN | NaN |
| 191533 | 35284 | Olea europaea :: Olive Tree | NaN | Sidewalk: Curb side : Cutout | NaN | NaN | NaN |
print(test_sf.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 191534 entries, 0 to 191533 Data columns (total 7 columns): TreeID 191534 non-null int64 qSpecies 191534 non-null object qAddress 190044 non-null object qSiteInfo 191534 non-null object DBH 150299 non-null float64 Latitude 188869 non-null float64 Longitude 188869 non-null float64 dtypes: float64(3), int64(1), object(3) memory usage: 10.2+ MB None
Upon inspecting the DataFrame, it was observed that the "Species" column contains both Latin and Common names, separated by "::". Given the non-research nature of this report, it was decided to retain the Common names of the tree species.
During the inspection of the dataset, it was noted that some species have numerous variations under one overarching species, such as "Skeena Cherries," "Bing Cherries," "Lapin Cherries," and others. For the specific purpose of this analysis, it was decided to consolidate these variations into a single common name. For example, all variations of cherries were renamed to "Cherry."
test_sf_info = test_sf.groupby('qSpecies')['TreeID'].nunique().nlargest(35)
test_sf_info
qSpecies Tree(s) :: 11597 Platanus x hispanica :: Sycamore: London Plane 11499 Metrosideros excelsa :: New Zealand Xmas Tree 8704 Lophostemon confertus :: Brisbane Box 8519 Pittosporum undulatum :: Victorian Box 7097 Tristaniopsis laurina :: Swamp Myrtle 7007 Prunus cerasifera :: Cherry Plum 6702 Magnolia grandiflora :: Southern Magnolia 6272 Arbutus 'Marina' :: Hybrid Strawberry Tree 5652 Ficus microcarpa nitida 'Green Gem' :: Indian Laurel Fig Tree 'Green Gem' 5641 Prunus serrulata 'Kwanzan' :: Kwanzan Flowering Cherry 4028 Acacia melanoxylon :: Blackwood Acacia 3956 Maytenus boaria :: Mayten 3895 Olea europaea :: Olive Tree 3682 Corymbia ficifolia :: Red Flowering Gum 3565 Callistemon citrinus :: Lemon Bottlebrush 3267 Ginkgo biloba :: Maidenhair Tree 3192 Pyrus calleryana :: Ornamental Pear 2981 Prunus serrulata :: Ornamental Cherry 2674 Ulmus parvifolia :: Chinese Elm 2348 Eriobotrya deflexa :: Bronze Loquat 2345 Pinus radiata :: Monterey Pine 2229 Ligustrum lucidum :: Glossy Privet 2194 Pyrus kawakamii :: Evergreen Pear 1948 :: 1898 Cupressus macrocarpa :: Monterey Cypress 1886 Pittosporum crassifolium :: Karo Tree 1792 Tristaniopsis laurina 'Elegant' :: Small-leaf Tristania 'Elegant' 1783 Melaleuca quinquenervia :: Cajeput 1703 Cordyline australis :: Dracena Palm 1624 Ficus nitida :: Laurel Fig 1612 Myoporum laetum :: Myoporum 1600 Liquidambar styraciflua :: American Sweet Gum 1514 Ficus retusa nitida :: Banyan Fig 1417 Juniperus chinensis :: Juniper 1405 Name: TreeID, dtype: int64
As we can observe from the data set, species column (qSpecies) has many unidentified tree names described by "Tree(s) ::" or "::", taking in to the account nature of this exercise, it was decided to drop trees with unidentified names and focus only on common names. To achieve this, an applymap lambda function was applied to select only the common names and drop trees with unidentified names. This process ensures that the analysis will focus on more recognizable and informative data.
# Define a function to extract common names from species entries
def namechange(x):
if type(x)==str:
if '::' in x:
y = re.sub('.+::','',x)
else:
y=x
else:
y=x
return y
# Apply the namechange function to the entire DataFrame, focusing on the "qSpecies" column
test_sf = test_sf.applymap(namechange)
# Resources used:
# -GeeksforGeeks: https://www.geeksforgeeks.org/python-pandas-dataframe-applymap/
# - Stack Overflow: https://stackoverflow.com/questions/48819767/pandas-dataframe-apply-lambda-if-else-erro
# - Stack Overflow: https://stackoverflow.com/questions/14029245/putting-an-if-elif-else-statement-on-one-line/14029436
test_sf.head(5)
| TreeID | qSpecies | qAddress | qSiteInfo | DBH | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|
| 0 | 196949 | Ornamental Pear | 9 Young Ct | Sidewalk: Curb side : Cutout | 16.0 | 37.736362 | -122.386202 |
| 1 | 203422 | Red Maple | 9 Yerba Buena Ave | Sidewalk: Curb side : Yard | 2.0 | 37.738392 | -122.465507 |
| 2 | 115737 | Red Maple | 9x Yerba Buena Ave | Sidewalk: Curb side : Cutout | 3.0 | 37.737752 | -122.464496 |
| 3 | 16476 | Nichol's Willow-Leafed Peppermint | 9X Newhall St | Sidewalk: Curb side : Cutout | 12.0 | 37.739219 | -122.377869 |
| 4 | 16473 | Nichol's Willow-Leafed Peppermint | 9X Newhall St | Sidewalk: Curb side : Cutout | 12.0 | 37.739219 | -122.377869 |
# Replace empty strings in the 'qSpecies' column with NaN
test_sf['qSpecies'].replace('', np.nan, inplace=True)
# Drop rows with NaN values in the 'qSpecies' column
test_sf.dropna(subset=['qSpecies'], inplace=True)
test_sf.head(20)
| TreeID | qSpecies | qAddress | qSiteInfo | DBH | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|
| 0 | 196949 | Ornamental Pear | 9 Young Ct | Sidewalk: Curb side : Cutout | 16.0 | 37.736362 | -122.386202 |
| 1 | 203422 | Red Maple | 9 Yerba Buena Ave | Sidewalk: Curb side : Yard | 2.0 | 37.738392 | -122.465507 |
| 2 | 115737 | Red Maple | 9x Yerba Buena Ave | Sidewalk: Curb side : Cutout | 3.0 | 37.737752 | -122.464496 |
| 3 | 16476 | Nichol's Willow-Leafed Peppermint | 9X Newhall St | Sidewalk: Curb side : Cutout | 12.0 | 37.739219 | -122.377869 |
| 4 | 16473 | Nichol's Willow-Leafed Peppermint | 9X Newhall St | Sidewalk: Curb side : Cutout | 12.0 | 37.739219 | -122.377869 |
| 5 | 16470 | Red Ironbark | 9X Newhall St | Sidewalk: Curb side : Cutout | 12.0 | 37.739219 | -122.377869 |
| 6 | 16475 | Nichol's Willow-Leafed Peppermint | 9X Newhall St | Sidewalk: Curb side : Cutout | 18.0 | 37.739219 | -122.377869 |
| 7 | 16478 | Flaxleaf Paperbark | 9X Newhall St | Sidewalk: Curb side : Cutout | 6.0 | 37.739219 | -122.377869 |
| 8 | 16472 | Nichol's Willow-Leafed Peppermint | 9X Newhall St | Sidewalk: Curb side : Cutout | 18.0 | 37.739219 | -122.377869 |
| 9 | 16471 | Nichol's Willow-Leafed Peppermint | 9X Newhall St | Sidewalk: Curb side : Cutout | 3.0 | 37.739219 | -122.377869 |
| 10 | 13304 | Red Flowering Gum | 9X Lyell St | Median : Cutout | 12.0 | 37.731771 | -122.431676 |
| 11 | 13302 | Red Flowering Gum | 9X Lyell St | Median : Cutout | 12.0 | 37.731771 | -122.431676 |
| 12 | 13301 | Red Flowering Gum | 9X Lyell St | Median : Cutout | 18.0 | 37.731771 | -122.431676 |
| 13 | 13303 | Red Flowering Gum | 9X Lyell St | Median : Cutout | 18.0 | 37.731771 | -122.431676 |
| 14 | 102172 | Kwanzan Flowering Cherry | 9X Hattie St | Sidewalk: Curb side : Cutout | 4.0 | 37.761558 | -122.440979 |
| 15 | 102171 | Kwanzan Flowering Cherry | 9X Hattie St | Sidewalk: Curb side : Cutout | 4.0 | 37.761558 | -122.440979 |
| 16 | 12977 | Mayten | 9X Castro St | Sidewalk: Curb side : Cutout | 4.0 | 37.769905 | -122.436179 |
| 17 | 116748 | Swamp Myrtle | 9x BLANKEN AVE | Sidewalk: Curb side : Cutout | NaN | 37.712291 | -122.401484 |
| 18 | 116746 | Swamp Myrtle | 9x BLANKEN AVE | Sidewalk: Curb side : Cutout | NaN | 37.712291 | -122.401484 |
| 19 | 116747 | Swamp Myrtle | 9x BLANKEN AVE | Sidewalk: Curb side : Cutout | NaN | 37.712291 | -122.401484 |
Above steps are used to replace empty strings ('') in the "qSpecies" column with NaN (Not a Number) and then remove rows with NaN values from the dataset.
# NaN cleaned dataset renamed
fixed_sf = test_sf
# Define a function to standardize tree species names
def treename(x):
if type(x)==str:
if ' Cherry Plum' in x:
y = 'purple-leaf plum'
elif 'Cherry' in x:
y= 'cherry'
elif 'Sycamore: London Plane' in x:
y= 'London planetree'
elif 'Southern Magnolia' in x:
y= 'southern magnolia'
else:
y=x
else:
y=x
return y
# Apply the treename function to the entire DataFrame, standardizing tree species names
fixed_sf = fixed_sf.applymap(treename)
#Resources used:
# - GeeksforGeeks: https://www.geeksforgeeks.org/python-pandas-dataframe-applymap/
# - Stack Overflow: https://stackoverflow.com/questions/48819767/pandas-dataframe-apply-lambda-if-else-erro
# - Stack Overflow: https://stackoverflow.com/questions/14029245/putting-an-if-elif-else-statement-on-one-line/14029436
In this section of the analysis, a function named treename is defined and applied to the DataFrame fixed_sf. The primary objective of this code is to standardize the tree species names by mapping different variations to common names. This step is taken to enhance the ease of further data analysis and improve overall understanding. Considering the non-scientific nature of this report, the decision was made to utilize common names for tree species. This simplification process resolves variations in species names, particularly addressing the issue of high variability in some species, such as 'Cherry,' which is now uniformly represented as 'cherry.'.
fixed_sf.head(10)
| TreeID | qSpecies | qAddress | qSiteInfo | DBH | Latitude | Longitude | |
|---|---|---|---|---|---|---|---|
| 0 | 196949 | Ornamental Pear | 9 Young Ct | Sidewalk: Curb side : Cutout | 16.0 | 37.736362 | -122.386202 |
| 1 | 203422 | Red Maple | 9 Yerba Buena Ave | Sidewalk: Curb side : Yard | 2.0 | 37.738392 | -122.465507 |
| 2 | 115737 | Red Maple | 9x Yerba Buena Ave | Sidewalk: Curb side : Cutout | 3.0 | 37.737752 | -122.464496 |
| 3 | 16476 | Nichol's Willow-Leafed Peppermint | 9X Newhall St | Sidewalk: Curb side : Cutout | 12.0 | 37.739219 | -122.377869 |
| 4 | 16473 | Nichol's Willow-Leafed Peppermint | 9X Newhall St | Sidewalk: Curb side : Cutout | 12.0 | 37.739219 | -122.377869 |
| 5 | 16470 | Red Ironbark | 9X Newhall St | Sidewalk: Curb side : Cutout | 12.0 | 37.739219 | -122.377869 |
| 6 | 16475 | Nichol's Willow-Leafed Peppermint | 9X Newhall St | Sidewalk: Curb side : Cutout | 18.0 | 37.739219 | -122.377869 |
| 7 | 16478 | Flaxleaf Paperbark | 9X Newhall St | Sidewalk: Curb side : Cutout | 6.0 | 37.739219 | -122.377869 |
| 8 | 16472 | Nichol's Willow-Leafed Peppermint | 9X Newhall St | Sidewalk: Curb side : Cutout | 18.0 | 37.739219 | -122.377869 |
| 9 | 16471 | Nichol's Willow-Leafed Peppermint | 9X Newhall St | Sidewalk: Curb side : Cutout | 3.0 | 37.739219 | -122.377869 |
For our project, we picked around ten thousand observations randomly from each dataset. We merged these two datasets and sorted out column names where needed. Our goal was to identify tree species with similar common names in both regions.
# Select a random sample of 10,000 rows from the 'fixed_sf' DataFrame
random_sf = fixed_sf.sample(n=10000)
print(random_sf.head())
TreeID qSpecies qAddress \
73862 220960 Lemon Bottlebrush 3320 21st St
133934 20830 Monterey Cypress 1801X Sunset Blvd
155020 125028 Blackwood Acacia 1454 Jefferson St
55645 104347 London planetree 417 31st Ave
138643 52295 Mayten 170 Eureka St
qSiteInfo DBH Latitude Longitude
73862 Sidewalk: Curb side : Cutout 15.0 37.756928 -122.421665
133934 Sidewalk: Curb side : Cutout 42.0 37.753398 -122.495116
155020 Sidewalk: Curb side : Cutout 8.0 37.805625 -122.436531
55645 Sidewalk: Curb side : Cutout 13.0 37.781347 -122.491355
138643 Sidewalk: Curb side : Cutout NaN 37.759565 -122.438248
# Select a random sample of 10,000 rows from the 'df_api' DataFrame
random_api = df_api.sample(n=10000)
print(random_api.head())
address boroname curb_loc guards health \
218816 504 WEST 180 STREET Manhattan OnCurb None Good
191468 50-050 193 STREET Queens OnCurb None Good
236046 600 HYLAN BOULEVARD Staten Island OffsetFromCurb None Good
57450 231-021 128 AVENUE Queens OnCurb None Good
279599 35 ROSE AVENUE Staten Island OnCurb None Fair
latitude longitude nta_name problems \
218816 40.847673 -73.932132 Washington Heights South None
191468 40.746803 -73.785647 Auburndale None
236046 40.605185 -74.076844 Grasmere-Arrochar-Ft. Wadsworth None
57450 40.683267 -73.733852 Laurelton None
279599 40.574779 -74.120321 New Dorp-Midland Beach Stones
spc_common spc_latin state status tree_dbh \
218816 littleleaf linden Tilia cordata New York Alive 4
191468 London planetree Platanus x acerifolia New York Alive 31
236046 maple Acer New York Alive 11
57450 pin oak Quercus palustris New York Alive 8
279599 Norway maple Acer platanoides New York Alive 13
tree_id trunk_wire x_sp y_sp zip_city \
218816 409240 No 1.003027e+06 248114.7522 New York
191468 397496 No 1.043644e+06 211429.7232 Fresh Meadows
236046 432810 No 9.629124e+05 159770.8787 Staten Island
57450 258271 No 1.058066e+06 188321.0217 Springfield Gardens
279599 459385 No 9.508247e+05 148707.0389 Staten Island
zipcode
218816 10033
191468 11365
236046 10305
57450 11413
279599 10306
Considering the size of the both data frames , for analysis purposes ten thousand observations were randomly selected.
random_sf.describe()
| TreeID | DBH | Latitude | Longitude | |
|---|---|---|---|---|
| count | 10000.000000 | 8266.000000 | 9919.000000 | 9919.000000 |
| mean | 129445.067300 | 10.220421 | 37.763637 | -122.440801 |
| std | 79260.916592 | 23.347985 | 0.192503 | 0.319581 |
| min | 2.000000 | 0.000000 | 37.708340 | -138.283252 |
| 25% | 52188.750000 | 3.000000 | 37.740248 | -122.453560 |
| 50% | 126025.000000 | 7.000000 | 37.760086 | -122.430999 |
| 75% | 204253.000000 | 12.000000 | 37.779924 | -122.412841 |
| max | 258218.000000 | 1920.000000 | 47.270219 | -122.368426 |
# Rename columns in the 'random_sf' DataFrame for clarity and consistency
random_sf=random_sf.rename(columns = {'qSpecies':'spc_common'})
random_sf=random_sf.rename(columns = {'Latitude':'Latitude_SF'})
random_sf=random_sf.rename(columns = {'Longitude':'Longitude_SF'})
Columns renamed under same format for further merger of two data sets. Merger conducted based species common name.
# Group the most common tree species ('spc_common') and count the unique 'TreeID' values for each group.
# Then, select the top 10 groups with the largest counts.
ln_sf = random_sf.groupby('spc_common')['TreeID'].nunique().nlargest(10)
ln_sf
#Resources used:
# - Stack Overflow: https://stackoverflow.com/questions/35364601/group-by-and-find-top-n-value-counts-pandas
spc_common London planetree 660 New Zealand Xmas Tree 487 cherry 451 Brisbane Box 446 Victorian Box 411 purple-leaf plum 388 Swamp Myrtle 382 southern magnolia 374 Hybrid Strawberry Tree 338 Indian Laurel Fig Tree 'Green Gem' 310 Name: TreeID, dtype: int64
# Merge of the NYC and SF tree census datasets ('random_sf' and 'random_api' DataFrames) based on the 'spc_common' column using an inner join.
# This combines data where the 'spc_common' values match in both DataFrames.
merged = pd.merge(random_sf, random_api, on=['spc_common'], how='inner')
merged.head()
| TreeID | spc_common | qAddress | qSiteInfo | DBH | Latitude_SF | Longitude_SF | address | boroname | curb_loc | guards | health | latitude | longitude | nta_name | problems | spc_latin | state | status | tree_dbh | tree_id | trunk_wire | x_sp | y_sp | zip_city | zipcode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 104347 | London planetree | 417 31st Ave | Sidewalk: Curb side : Cutout | 13.0 | 37.781347 | -122.491355 | 50-050 193 STREET | Queens | OnCurb | None | Good | 40.746803 | -73.785647 | Auburndale | None | Platanus x acerifolia | New York | Alive | 31 | 397496 | No | 1.043644e+06 | 211429.7232 | Fresh Meadows | 11365 |
| 1 | 104347 | London planetree | 417 31st Ave | Sidewalk: Curb side : Cutout | 13.0 | 37.781347 | -122.491355 | 25 PINE TERRACE | Staten Island | OnCurb | None | Good | 40.537586 | -74.177937 | Annadale-Huguenot-Prince's Bay-Eltingville | None | Platanus x acerifolia | New York | Alive | 12 | 380728 | No | 9.347914e+05 | 135183.8448 | Staten Island | 10312 |
| 2 | 104347 | London planetree | 417 31st Ave | Sidewalk: Curb side : Cutout | 13.0 | 37.781347 | -122.491355 | 1445 LONGFELLOW AVENUE | Bronx | OnCurb | None | Fair | 40.830801 | -73.887539 | Crotona Park East | Stones | Platanus x acerifolia | New York | Alive | 13 | 493987 | No | 1.015372e+06 | 241980.5482 | Bronx | 10459 |
| 3 | 104347 | London planetree | 417 31st Ave | Sidewalk: Curb side : Cutout | 13.0 | 37.781347 | -122.491355 | 1883 VYSE AVENUE | Bronx | OnCurb | None | Fair | 40.839751 | -73.884283 | East Tremont | None | Platanus x acerifolia | New York | Alive | 11 | 441819 | No | 1.016269e+06 | 245242.5014 | Bronx | 10460 |
| 4 | 104347 | London planetree | 417 31st Ave | Sidewalk: Curb side : Cutout | 13.0 | 37.781347 | -122.491355 | 837 CARROLL STREET | Brooklyn | OnCurb | None | Good | 40.671808 | -73.972644 | Park Slope-Gowanus | None | Platanus x acerifolia | New York | Alive | 18 | 207864 | No | 9.918386e+05 | 184035.3164 | Brooklyn | 11215 |
merged.to_csv("mergedDF1.csv", index=False)
Data set merged and saved as a CSV file. Given the size of the full data set from the NYC Tree Census API, it may not be practical to save it in a CSV file for further analysis.
# Group the 'merged' DataFrame by tree species ('spc_common') and count the unique 'tree_id' values for each group.
m_sfny = merged.groupby('spc_common')['tree_id'].nunique()
m_sfny
#Source used:
#https://datascience.stackexchange.com/questions/33053/how-do-i-compare-columns-in-different-data-frames
spc_common London planetree 1246 cherry 441 purple-leaf plum 112 southern magnolia 3 Name: tree_id, dtype: int64
# Convert column names to lowercase for consistency
merged.columns = map(str.lower, merged.columns)
# Transform the 'spc_common' column to lowercase
merged['spc_common'].str.lower()
# Display the first few rows of the DataFrame to check the changes
merged.head()
| treeid | spc_common | qaddress | qsiteinfo | dbh | latitude_sf | longitude_sf | address | boroname | curb_loc | guards | health | latitude | longitude | nta_name | problems | spc_latin | state | status | tree_dbh | tree_id | trunk_wire | x_sp | y_sp | zip_city | zipcode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 104347 | London planetree | 417 31st Ave | Sidewalk: Curb side : Cutout | 13.0 | 37.781347 | -122.491355 | 50-050 193 STREET | Queens | OnCurb | None | Good | 40.746803 | -73.785647 | Auburndale | None | Platanus x acerifolia | New York | Alive | 31 | 397496 | No | 1.043644e+06 | 211429.7232 | Fresh Meadows | 11365 |
| 1 | 104347 | London planetree | 417 31st Ave | Sidewalk: Curb side : Cutout | 13.0 | 37.781347 | -122.491355 | 25 PINE TERRACE | Staten Island | OnCurb | None | Good | 40.537586 | -74.177937 | Annadale-Huguenot-Prince's Bay-Eltingville | None | Platanus x acerifolia | New York | Alive | 12 | 380728 | No | 9.347914e+05 | 135183.8448 | Staten Island | 10312 |
| 2 | 104347 | London planetree | 417 31st Ave | Sidewalk: Curb side : Cutout | 13.0 | 37.781347 | -122.491355 | 1445 LONGFELLOW AVENUE | Bronx | OnCurb | None | Fair | 40.830801 | -73.887539 | Crotona Park East | Stones | Platanus x acerifolia | New York | Alive | 13 | 493987 | No | 1.015372e+06 | 241980.5482 | Bronx | 10459 |
| 3 | 104347 | London planetree | 417 31st Ave | Sidewalk: Curb side : Cutout | 13.0 | 37.781347 | -122.491355 | 1883 VYSE AVENUE | Bronx | OnCurb | None | Fair | 40.839751 | -73.884283 | East Tremont | None | Platanus x acerifolia | New York | Alive | 11 | 441819 | No | 1.016269e+06 | 245242.5014 | Bronx | 10460 |
| 4 | 104347 | London planetree | 417 31st Ave | Sidewalk: Curb side : Cutout | 13.0 | 37.781347 | -122.491355 | 837 CARROLL STREET | Brooklyn | OnCurb | None | Good | 40.671808 | -73.972644 | Park Slope-Gowanus | None | Platanus x acerifolia | New York | Alive | 18 | 207864 | No | 9.918386e+05 | 184035.3164 | Brooklyn | 11215 |
# Create a new datasets (DataFrame 'm_ns') by selecting specific columns from 'merged' dataset and renaming them for clarity.
m_ns = pd.DataFrame(merged, columns = ['spc_common','tree_dbh', 'dbh','tree_id','treeid'])
m_ns=m_ns.rename(columns = {'tree_dbh':'nyc_dbh'})
m_ns=m_ns.rename(columns = {'dbh':'sf_dbh'})
m_ns=m_ns.rename(columns = {'tree_id':'nyc_tree_id'})
m_ns=m_ns.rename(columns = {'treeid':'sf_tree_id'})
# Transform the 'spc_common' column to lowercase for consistency
m_ns['spc_common'] = m_ns['spc_common'].str.lower()
m_ns.dtypes
spc_common object nyc_dbh int32 sf_dbh float64 nyc_tree_id object sf_tree_id int64 dtype: object
# Convert the 'nyc_dbh' column to a float data type for precise decimal representation
m_ns["nyc_dbh"] = m_ns.nyc_dbh.astype(float)
# Convert the 'nyc_tree_id' column to an integer data type for whole number representation
m_ns["nyc_tree_id"] = m_ns.nyc_tree_id.astype(int)
To gain a better understanding of representative samples from both cities, we employed the 'groupby' function and exploratory statistics to analyze common tree species. Our analysis primarily focused on the Tree Diameter at Breast Height (DBH), a standard method of measuring the diameter of a tree trunk or bole, typically 1.3 meters above ground level. This approach allowed us to uncover similarities between the two cities in terms of tree species and their characteristics.
# Group the data in the dataset ('m_ns' DataFrame) by tree species ('spc_common') and calculate statistics on 'sf_dbh' - tree species diameter.
# The statistics include the minimum, maximum, and mean values for each tree species.
m_ns.groupby('spc_common').sf_dbh.agg(['min','max','mean']).round()
| min | max | mean | |
|---|---|---|---|
| spc_common | |||
| cherry | 1.0 | 42.0 | 5.0 |
| london planetree | 1.0 | 1920.0 | 16.0 |
| purple-leaf plum | 0.0 | 22.0 | 6.0 |
| southern magnolia | 0.0 | 80.0 | 7.0 |
Exploratory statistics on DBH of sample drawn from SF dataset.
m_ns.groupby('spc_common').nyc_dbh.agg(['min','max','mean']).round()
| min | max | mean | |
|---|---|---|---|
| spc_common | |||
| cherry | 0.0 | 23.0 | 5.0 |
| london planetree | 2.0 | 255.0 | 21.0 |
| purple-leaf plum | 1.0 | 13.0 | 5.0 |
| southern magnolia | 2.0 | 16.0 | 7.0 |
Exploratory statistics on DBH of sample drawn from NYC dataset.
# Group the data in merged dataset by tree species ('spc_common') and calculate the number of unique 'nyc_tree_id' and 'sf_tree_id' (common tree ) values for each species in both cities.
m_ns.groupby('spc_common')['nyc_tree_id','sf_tree_id'].nunique()
| nyc_tree_id | sf_tree_id | |
|---|---|---|
| spc_common | ||
| cherry | 441 | 451 |
| london planetree | 1246 | 660 |
| purple-leaf plum | 112 | 388 |
| southern magnolia | 3 | 374 |
Upon comparing prevalence of similar species in merged dataset, we observed that the London Plane tree is notably more common in New York City (NYC). It's worth noting that this tree species is so abundant in NYC that it falls under the NYC Parks Department's list of restricted-use species for street tree planting, as it constitutes more than 10% of all street trees in the city.
On the other hand, the Purple Leaf Plum tree, originally from Southeast Europe and known for its adaptability to milder climates, is more populous in San Francisco (SF). This phenomenon is likely attributed to the similarity in climate to its region of origin.
Additionally, when examining the average diameter at breast height, it became evident that London Plane trees in NYC are generally larger. This size difference is likely due to climate conditions, as the London Plane tree is native to the Northern Hemisphere.
In summary, this analysis highlights the different abundances and characteristics of these tree species in NYC and SF, which can be attributed to climate and historical factors.</font>
Please see below for graphical representation of the common species between two data sets.
Source retrieved from : https://en.wikipedia.org/wiki/Platanus_×_acerifolia / https://en.wikipedia.org/wiki/Cherry_plum
# Group the data in the dataset ('m_ns' DataFrame) by tree species ('spc_common') and calculate the number of unique 'nyc_tree_id' and 'sf_tree_id' values for each species for further analysis.
m_plot = m_ns.groupby('spc_common')['nyc_tree_id','sf_tree_id'].nunique()
# Adjust the figure size and font size for the plot
rcParams['figure.figsize'] = 18, 8
rcParams['font.size'] = 25
# Aesthetics: plot style set to 'ggplot'
plt.style.use('ggplot')
# Bar plot to compare the number of unique tree identifiers for NYC and SF, using different colors for each city
m_plot.loc[:,['nyc_tree_id','sf_tree_id']].plot(kind='bar', color=['#bc8aad', '#F26458'])
plt.title('NYC vs SF trees', alpha=1.75, fontsize=15)
Text(0.5, 1.0, 'NYC vs SF trees')
The code below prepares and visualizes data to present a treemap illustrating the tree population across different NYC boroughs.
# Preparing Data: Grouping tree data by borough and calculating counts
df = df_api.groupby('boroname').size().reset_index(name='counts')
# Creating labels for boroughs with counts
labels = df.apply(lambda x: str(x[0]) + "\n (" + str(x[1]) + ")", axis=1)
# Extracting count values and defining colors for the treemap
sizes = df['counts'].values.tolist()
colors = [plt.cm.Spectral(i/float(len(labels))) for i in range(len(labels))]
# Creating the treemap visualization
plt.figure(figsize=(12,8), dpi= 80)
squarify.plot(sizes=sizes, label=labels, color=colors, alpha=.8)
# References used:
# https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/
# Visualization Type: Treemap
<matplotlib.axes._subplots.AxesSubplot at 0x20d81fc16d8>
As we can observe from above plot, it's clear that Queens and Brooklyn have the most trees, making up about 51% of the total observed species in our sample. Manhattan, despite being the third-largest borough with over 1.6 million residents, has the fewest tree observations. This might be due to the high density of buildings. Brooklyn is the second most populated with 2.5 million residents, and Queens is third with around 2.3 million residents.
You can find resident data for 2018 in this source: https://www.citypopulation.de/php/usa-newyorkcity.php
# Group the 'df_api' DataFrame by tree species ('spc_common') and count the number of unique 'tree_id' values for each species.
df_aw = df_api.groupby('spc_common')['tree_id'].nunique().nlargest(30)
df_aw
spc_common London planetree 36919 honeylocust 27956 Callery pear 27478 pin oak 21437 Norway maple 15567 cherry 13077 Japanese zelkova 12781 littleleaf linden 12321 ginkgo 9241 Sophora 8733 red maple 8660 green ash 6559 American linden 6141 silver maple 5266 sweetgum 4890 American elm 3579 northern red oak 3506 purple-leaf plum 3349 swamp white oak 2739 Chinese elm 2668 maple 2611 silver linden 2484 'Schubert' chokecherry 2349 crimson king maple 2304 Japanese tree lilac 2036 golden raintree 1940 eastern redbud 1896 crab apple 1751 willow oak 1513 hedge maple 1503 Name: tree_id, dtype: int64
For use of further analysis, the table above represents the quantity of the most common tree species observed in our dataset.
# Filter the dataset ('df_api' DataFrame) to select specific tree species by their common names.
df_top = df_api.loc[df_api['spc_common'].isin(['London planetree','honeylocust','Callery pear','pin oak','Norway maple','cherry','Japanese zelkova','littleleaf linden','ginkgo','Sophora'])]
df_top['spc_common'] = df_top['spc_common'].str.lower()
C:\Users\Avrora\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
# Create a cross-tabulation (crosstab) of tree species distribution in NYC boroughs using data from the 'df_top' DataFrame.
df_rapi1 = pd.crosstab(df_top.boroname, df_top.spc_common, margins= True)
df_rapi1
| spc_common | callery pear | cherry | ginkgo | honeylocust | japanese zelkova | littleleaf linden | london planetree | norway maple | pin oak | sophora | All |
|---|---|---|---|---|---|---|---|---|---|---|---|
| boroname | |||||||||||
| Bronx | 3395 | 2801 | 1759 | 6495 | 3066 | 2565 | 5193 | 2496 | 4280 | 1756 | 33806 |
| Brooklyn | 3740 | 2938 | 2337 | 6747 | 3998 | 3651 | 14816 | 3485 | 5046 | 2634 | 49392 |
| Manhattan | 3173 | 270 | 2810 | 6178 | 1580 | 1497 | 1789 | 137 | 2013 | 1968 | 21415 |
| Queens | 5174 | 3956 | 1780 | 5951 | 2678 | 3522 | 9082 | 7042 | 5812 | 1877 | 46874 |
| Staten Island | 11996 | 3112 | 555 | 2585 | 1459 | 1086 | 6039 | 2407 | 4286 | 498 | 34023 |
| All | 27478 | 13077 | 9241 | 27956 | 12781 | 12321 | 36919 | 15567 | 21437 | 8733 | 185510 |
# Create a heatmap visualization of the tree species distribution in NYC boroughs using the 'df_rapi1' DataFrame.
fig, ax = plt.subplots(figsize=(18,10))
sns.heatmap(df_rapi1, cmap="YlGnBu", annot=True, fmt="d", ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x20d7ee06da0>
The graph above breaks down the agglomeration of NYC trees by boroughs, focusing on the top 10 tree species and a representative sample of 300 thousand observations. As mentioned before, the London Plane tree is the most common species in our sample, and it's predominantly found in Brooklyn, with nearly 15 thousand recorded observations. Following closely is the Callery Pear, mainly located in Staten Island, with around 12 thousand observations.
During the initial review of the dataset, we observed that certain observations had exceptionally large trunk diameters (DBH). To address this, we conducted basic outlier detection procedures and removed trees with a DBH exceeding 60 cm.
df_api.index
RangeIndex(start=0, stop=300000, step=1)
# Filter the dataset ('df_api' DataFrame) to include only specific tree species of interest (based on heatmap).
df_mean = df_api.loc[df_api['spc_common'].isin(['London planetree','honeylocust','Callery pear','pin oak','Norway maple'])]
df_mean.head(5)
| address | boroname | curb_loc | guards | health | latitude | longitude | nta_name | problems | spc_common | spc_latin | state | status | tree_dbh | tree_id | trunk_wire | x_sp | y_sp | zip_city | zipcode | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 147-074 7 AVENUE | Queens | OnCurb | None | Fair | 40.794111 | -73.818679 | Whitestone | Stones | pin oak | Quercus palustris | New York | Alive | 21 | 200540 | No | 1.034456e+06 | 228644.8374 | Whitestone | 11357 |
| 2 | 390 MORGAN AVENUE | Brooklyn | OnCurb | None | Good | 40.717581 | -73.936608 | East Williamsburg | None | honeylocust | Gleditsia triacanthos var. inermis | New York | Alive | 3 | 204026 | No | 1.001823e+06 | 200716.8913 | Brooklyn | 11211 |
| 3 | 1027 GRAND STREET | Brooklyn | OnCurb | None | Good | 40.713537 | -73.934456 | East Williamsburg | Stones | honeylocust | Gleditsia triacanthos var. inermis | New York | Alive | 10 | 204337 | No | 1.002420e+06 | 199244.2531 | Brooklyn | 11211 |
| 5 | 8 COLUMBUS AVENUE | Manhattan | OnCurb | Helpful | Good | 40.770046 | -73.984950 | Lincoln Square | None | honeylocust | Gleditsia triacanthos var. inermis | New York | Alive | 11 | 190422 | No | 9.884187e+05 | 219825.5227 | New York | 10023 |
| 6 | 120 WEST 60 STREET | Manhattan | OnCurb | Helpful | Good | 40.770210 | -73.985338 | Lincoln Square | None | honeylocust | Gleditsia triacanthos var. inermis | New York | Alive | 11 | 190426 | No | 9.883112e+05 | 219885.2785 | New York | 10023 |
# Create a boxplot to visualize the distribution of tree diameters (DBH) for the selected tree species.
plt.figure(figsize=(18, 6))
sns.boxplot(x=df_mean['tree_dbh'])
<matplotlib.axes._subplots.AxesSubplot at 0x20d074d4eb8>
As we can see from the graphical representation of the outliers based on tree diameter at breast height (DBH), the population mean is approximately 13 cm. The majority of the population, which falls within the interquartile range (between Q1 and Q3), is concentrated between 8 and 20 cm in diameter. This indicates the typical range for tree diameters within the dataset.
# Create a scatter plot to explore the relationship between tree species (SPC Common name) and their sizes (Tree Size).
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(df_mean['spc_common'], df_mean['tree_dbh'])
ax.set_xlabel('SPC Common name')
ax.set_ylabel('Tree Size')
plt.show()
#This scatter plot helps us examine the distribution of tree sizes for different species, allowing us to observe any patterns or variations in tree size based on their common names.
# Remove tree observations with a DBH of 60 cm or greater from the 'df_mean' DataFrame.
df_mean.drop(df_mean[ df_mean['tree_dbh'] >= 60].index , inplace=True)
df_mean.describe()
C:\Users\Avrora\Anaconda3\lib\site-packages\pandas\core\frame.py:3697: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
| latitude | longitude | tree_dbh | x_sp | y_sp | zipcode | |
|---|---|---|---|---|---|---|
| count | 129319.000000 | 129319.000000 | 129319.00000 | 1.293190e+05 | 129319.000000 | 129319.000000 |
| mean | 40.696850 | -73.946267 | 14.58699 | 9.991154e+05 | 193188.791780 | 10804.477749 |
| std | 0.100946 | 0.129126 | 8.65722 | 3.582568e+04 | 36775.042434 | 610.274034 |
| min | 40.498510 | -74.254432 | 0.00000 | 9.134969e+05 | 120989.630100 | 83.000000 |
| 25% | 40.613574 | -73.994756 | 8.00000 | 9.857030e+05 | 162829.314000 | 10312.000000 |
| 50% | 40.692852 | -73.930817 | 13.00000 | 1.003426e+06 | 191746.680200 | 11101.000000 |
| 75% | 40.768642 | -73.858428 | 20.00000 | 1.023457e+06 | 219343.937550 | 11235.000000 |
| max | 40.910883 | -73.700488 | 59.00000 | 1.067248e+06 | 271154.286500 | 11694.000000 |
Removing outliers with a DBH greater than 60 cm results in a more representative and improved distribution of the tree population's DBH sizes. This helps ensure that extreme values do not disproportionately impact the analysis, allowing for a more accurate understanding of the typical tree sizes in the dataset.
# Create a scatter plot to explore the relationship between tree species (SPC Common name) and their sizes (Tree Size) with outliers removed.
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(df_mean['spc_common'], df_mean['tree_dbh'])
ax.set_xlabel('SPC Common name')
ax.set_ylabel('Tree Size')
plt.show()
#Considering the nature of the date used violin plot to visualize the distribution of tree sizes for the five most common species, with some specific visual settings.
sns.set(style="whitegrid")
df = df_mean
# Create the violin plot with customized visual settings.
f, ax = plt.subplots(figsize=(18, 12))
sns.violinplot(x="spc_common",y="tree_dbh",data=df, inner="box", palette="Greens", bw =.3, cut=.5, linewidth=3)
sns.despine(left=True)
# Set the plot title and axis labels.
f.suptitle('Tree Size Distribution (5 most common)', fontsize=14, fontweight='bold')
ax.set_xlabel("Tree's Common Name",size = 16,alpha=0.7)
ax.set_ylabel("Tree diameter measured at 4.5 feet above the ground ",size = 16,alpha=0.7)
# Resourses Used:
# https://mode.com/blog/violin-plot-examples
C:\Users\Avrora\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
Text(0, 0.5, 'Tree diameter measured at 4.5 feet above the ground ')
Based on the information from the provided resources, the London plane tree is the largest among the representative sample. It can grow to a height of 20–30 meters (66–98 feet) and exceptionally even taller, with a trunk that can reach a circumference of 3 meters (10 feet) or more. On the other hand, the Callery Pear, also known as Pyrus calleryana, is the smallest, typically reaching a height of 5 to 8 meters (16 to 26 feet) with a conical to rounded crown.
The graph above depicts the overall diameter at breast height (DBH) size distribution of the top five species. It aligns with the information from the resources, highlighting that the London plane tree is the largest and tallest species in our representative sample. </font>
Resources used:
#df_api.groupby('boroname')['health'].unique()
#df_api.groupby('boroname')['health'].count()
# Group the 'df_api' DataFrame by borough and health, then count the occurrences of each combination.
df_h = df_api.groupby(['boroname', 'health']).size().reset_index(name='counts')
df_h
| boroname | health | counts | |
|---|---|---|---|
| 0 | Bronx | Fair | 6847 |
| 1 | Bronx | Good | 44362 |
| 2 | Bronx | Poor | 2163 |
| 3 | Brooklyn | Fair | 11053 |
| 4 | Brooklyn | Good | 58229 |
| 5 | Brooklyn | Poor | 3528 |
| 6 | Manhattan | Fair | 5361 |
| 7 | Manhattan | Good | 21181 |
| 8 | Manhattan | Poor | 2039 |
| 9 | Queens | Fair | 12920 |
| 10 | Queens | Good | 54994 |
| 11 | Queens | Poor | 4143 |
| 12 | Staten Island | Fair | 7193 |
| 13 | Staten Island | Good | 49506 |
| 14 | Staten Island | Poor | 2081 |
df_h.dtypes
boroname object health object counts int64 dtype: object
# Set Plotly credentials with your username and API key.
plotly.tools.set_credentials_file(username='XXX', api_key='XXX')
# Initialize Plotly for notebook usage.
offline.init_notebook_mode(connected=True)
# Import the required modules from Plotly.
import plotly.offline as pyo
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
df_zz = pd.crosstab(df_api.boroname, df_api.health, margins= True)
df_zz.dtypes
health Fair int64 Good int64 Poor int64 All int64 dtype: object
#df_zz = df_zz.drop(columns="All")
#df_zz = df_zz.drop("All", axis=0)
df_plotly = df_zz
df_plotly
| health | Fair | Good | Poor | All |
|---|---|---|---|---|
| boroname | ||||
| Bronx | 6847 | 44362 | 2163 | 53372 |
| Brooklyn | 11053 | 58229 | 3528 | 72810 |
| Manhattan | 5361 | 21181 | 2039 | 28581 |
| Queens | 12920 | 54994 | 4143 | 72057 |
| Staten Island | 7193 | 49506 | 2081 | 58780 |
| All | 43374 | 228272 | 13954 | 285600 |
df_plotly = df_plotly.drop("All", axis=0)
# Define three Bar traces for 'Good', 'Fair', and 'Poor' health categories.
trace0 = go.Bar(
x = df_plotly.index,
y = df_plotly.Good,
marker=dict(
color='#3ACB5B'),
name='Good'
)
trace1 = go.Bar(
x = df_plotly.index,
y = df_plotly.Fair,
marker=dict(
color='#609E64'),
name='Fair'
)
trace2 = go.Bar(
x = df_plotly.index,
y = df_plotly.Poor,
marker=dict(
color='#B97D66'),
name='Poor'
)
# Create Scatter traces for additional data visualization.
traces= [go.Scatter(
x = df_plotly.columns,
y = df_plotly.loc[rowname],
mode = 'markers+lines',
name = 'xxx'
) for rowname in df_plotly.index]
# Define the data and layout for the Plotly figure
data = [trace0, trace1, trace2]
layout = go.Layout(
barmode='stack', title='Trees Health Status by Borough')
fig = go.Figure(data=data, layout=layout)
# Display the figure using Plotly.
pyo.offline.iplot(fig, filename='stacked-bar')
#Source retrieved from: https://plot.ly/python/bar-charts/
The graph above provides an overview of the assumed health status of the tree population. It was generated using the Plotly library and includes dynamic columns. Based on the summary output of the cross-tabulation function, we can conclude that approximately 80% or 228,000 trees are in presumed 'Good Health' condition. The representative sample consists of 300,000 observations, and this conclusion is drawn from 285,000 observations, as the rest were dropped due to missing data.
# Importing the Gmaps Library and Datasets
import gmaps
import gmaps.datasets
# Randomly sample 10,000 rows from the 'df_api' DataFrame and display the first few rows.
random_api = df_api.sample(n=10000)
print(random_api.head())
address boroname curb_loc guards health \
209309 1337 EAST 29 STREET Brooklyn OnCurb None Good
16903 300 WEST SERVICE ROAD Staten Island OnCurb None Poor
115962 2575 FISH AVENUE Bronx OnCurb NaN NaN
294086 74 ADLAI CIRCLE Staten Island OnCurb None Good
117097 373 11 STREET Brooklyn OnCurb None Good
latitude longitude nta_name \
209309 40.618539 -73.946270 Flatlands
16903 40.585771 -74.193796 New Springville-Bloomfield-Travis
115962 40.864909 -73.847263 Allerton-Pelham Gardens
294086 40.544407 -74.173142 Annadale-Huguenot-Prince's Bay-Eltingville
117097 40.667320 -73.985843 Park Slope-Gowanus
problems spc_common spc_latin state status \
209309 None purple-leaf plum Prunus cerasifera New York Alive
16903 BranchLights littleleaf linden Tilia cordata New York Alive
115962 NaN NaN NaN New York Stump
294086 None pin oak Quercus palustris New York Alive
117097 None sawtooth oak Quercus acutissima New York Alive
tree_dbh tree_id trunk_wire x_sp y_sp zip_city \
209309 8 418697 No 9.991665e+05 164631.2653 Brooklyn
16903 8 197701 No 9.304220e+05 152748.1578 Staten Island
115962 0 330277 No 1.026496e+06 254424.0739 Bronx
294086 17 507626 No 9.361291e+05 137666.1968 Staten Island
117097 3 299202 No 9.881774e+05 182399.2681 Brooklyn
zipcode
209309 11210
16903 10314
115962 10469
294086 10312
117097 11215
# Configure gmaps with your personal API key
gmaps.configure("AIzaSyANVf8cXZ3KusFgBAnLuiaQcO6i9eVN3ZU") #personal key
# Create a hybrid map centered at specific coordinates with custom layout settings
fig = gmaps.figure(map_type="HYBRID", center=(40.706742, -73.816493), zoom_level=10,
layout={
'width': '1200px',
'height': '600px',
'padding': '3px',
'border': '1px solid black'
})
# Retrieve location data from the 'df_api' DataFrame
locations = df_api[['latitude', 'longitude']]
# Create a heatmap layer with customized settings
heatmap = gmaps.heatmap_layer(locations)
heatmap.max_intensity = 2
heatmap.point_radius = 2
heatmap.gradient = ['white', 'green']
# Add the heatmap layer to the map
fig.add_layer(heatmap)
fig
#Data retreived: https://jupyter-gmaps.readthedocs.io/en/latest/api.html
Figure(layout=FigureLayout(border='1px solid black', height='600px', padding='3px', width='1200px'))
Please make sure to place html file and included MapLocation.png in to one folder before you open html file. Thank you.
from IPython.display import Image
image_path = "C:\\dev\\a\\notebooks\\MapLocations.png"
# Display the image
Image(filename=image_path)
To plot the coordinates of NYC tree census data, two different techniques were chosen: Google Maps (gmaps) and Seaborn libraries. When using the Gmap library, the map can only be visualized at runtime, and the only way to save the output as an HTML file is to take a screenshot and upload it into the Python environment, which was done in this case. Additionally, a Seaborn plot was created based on a sample of 10 thousand observations, providing an accurate representation of the five boroughs of NYC.
tree_df = df_api
# Create a scatterplot to visualize tree locations based on coordinates
plt.figure(figsize=[20,10])
sns.scatterplot(x='longitude', y='latitude', palette="YlGnBu", data=tree_df.sample(10000))
# Plot title
plt.title('Tree location by coordinates in seaborn');
#retrieved from: https://seaborn.pydata.org/introduction.html
# Create a database in RAM
connection = sqlite3.connect(':memory:')
# Creates or opens a file called mySQLiteDB.sl3 with a SQLite3 DB
connection = sqlite3.connect('../datasets/mySQLiteDB.sl3')
cursor = connection.cursor()
cursor.execute("DROP TABLE IF EXISTS df_api")
df_api.to_sql(name='df_api', con=connection)
#Code Retrieved from: https://stackoverflow.com/questions/14431646/how-to-write-pandas-dataframe-to-sqlite-with-index
# Test Connection
df_sql = pd.read_sql('select spc_common from df_api', connection)
# Valid connection results
df_sql.head(10)
| spc_common | |
|---|---|
| 0 | red maple |
| 1 | pin oak |
| 2 | honeylocust |
| 3 | honeylocust |
| 4 | American linden |
| 5 | honeylocust |
| 6 | honeylocust |
| 7 | American linden |
| 8 | honeylocust |
| 9 | London planetree |
# Writing and storing dataset to an HDF5 file.
df_plotly.to_hdf('data.h5', key='df_plotly', mode='w')
Above exercise demonstrates data persistence skills in two formats, SQLite and HDF5.
The datasets we've worked with in this project offer a great chance to showcase a comprehensive data analysis workflow, covering data acquisition, data cleaning, data merging, data storage, and data analysis, all using Python. The variety of variables in the dataset allows for insights from multiple perspectives.
I truly admire, how much work has gone into collecting and maintaining this data. Given the ongoing nature of the Tree Census project, our analysis can serve as an interesting starting point for anyone looking to explore and understand New York City's tree population.